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Abstract 

Valuable  data  that  would  strengthen  meta- analyses  are  often  presented  in  graphs  without 
reported  means  and  standard  deviations.  The  true  state  of  knowledge  about  investigative 
questions  is  not  accurately  represented  because  this  data  is  not  included  in  the  analysis. 
This  paper  describes  and  evaluates  a  method  for  extracting  estimated  effect  sizes  from 
graphic  presentations.  Two  studies  were  conducted  to  assess  the  reliability  and  accuracy 
of  using  electronic  calipers  to  estimate  effect  sizes  from  bar  and  line  graphs.  The  first 
study  looked  at  the  reliability  of  effect  size  estimates  derived  from  measurements  taken 
from  published  graphs  showing  changes  in  one-repetition  maximum  strength.  The  second 
study  assessed  the  accuracy  of  effect  size  estimates  computed  from  graphs  that  were 
constructed  from  known  means  and  standard  deviations.  The  first  study  demonstrated 
very  high  levels  of  test-retest  and  inter-rater  reliability  for  the  effect  size  estimates.  The 
second  study  showed  a  close  correspondence  between  the  effect  sizes  estimated  from  the 
graphs  and  the  known  effect  sizes  used  to  construct  the  graphs.  Thus,  using  electronic 
calipers  to  estimate  effect  sizes  from  graphs  produces  results  that  are  accurate  and 
reliable.  Meta-analysts  can  confidently  use  this  methodology  to  include  results  that  have 
been  presented  only  graphically. 
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Introduction 

Meta-analysis  is  a  tool  for  providing  a  systematic,  quantitative  summary  of 
findings  in  a  research  domain.  Meta-analysis  integrates  results  from  different  studies  by 
pooling  effect  sizes  (Cooper  &  Hedges,  1994;  Hedges  &  Olkin,  1985).  This  method  has 
been  used  to  summarize  the  state  of  the  art  in  research  in  several  areas  of  applied 
physiology  (Lemura,  von  Duvillard,  &  Mookeriee,  2000;  Londeree,  1997;  Rhea,  Alvar, 
Burkett,  &  Ball,  2003). 

Standard  practices  in  reporting  applied  physiological  research  findings  pose  some 
problems  for  meta-analysis.  The  use  of  graphs  is  one  such  problem.  To  illustrate, 
consider  the  experimental-control  group  research  design  that  is  typically  used  in  applied 
physiology  studies.  Measurements  are  taken  on  both  groups  before  and  after  introducing 
an  intervention.  For  example,  resistance  training  program  studies  commonly  involve 
assigning  study  participants  to  treatment  and  control  groups  and  obtaining  strength 
measurements  before  and  after  the  training  period  for  the  experimental  group.  If  the  study 
findings  give  the  means  and  standard  deviations  of  the  pre  and  posttraining 
measurements,  an  effect  size  suitable  for  use  in  a  meta-analysis  can  be  computed  directly 
from  the  information  given.  The  meta-analysis  problems  arise  when  the  results  are 
reported  only  in  graphs. 

Effect  sizes  cannot  be  computed  directly  from  graphs — so  the  meta-analyst  must 
either  ignore  those  studies,  or  use  the  graphs  to  estimate  the  appropriate  statistics. 

Ignoring  the  studies  reduces  the  body  of  evidence  available  to  describe  the  state  of  the  art 
in  the  area  of  interest.  This  point  is  important  because  graphic  presentation  forms  a 
significant  proportion  of  the  available  evidence  for  some  reviews  (e.g.,  Galvao  and 
Taaffe,  2004).  In  such  cases,  omitting  the  results  embodied  in  the  graphical 
representations  certainly  reduces  the  precision  of  aggregate  effect  size  estimates  and  may 
also  bias  the  aggregate  estimates. 

A  method  of  estimating  effect  sizes  from  graphs  would  be  a  useful  tool  to  ensure 
that  meta-analyses  accurately  represent  the  state  of  knowledge  within  a  research  domain. 
This  report  evaluates  a  simple  method  of  converting  graphs  to  estimates  of  means  and 
standard  deviations  (SDs).  The  method  relies  on  caliper  measurements  taken  by  one  or 
more  investigators.  Two  questions  about  estimating  effect  sizes  from  graphs  were 
addressed:  (1)  How  reliable  are  the  measurements?  (2)  How  accurate  are  the  estimated 
effect  sizes  derived  from  the  measurements?  Two  studies  were  conducted  to  address 
these  questions.  The  first  study  assessed  the  test-retest  and  inter-rater  reliability  of  effect 
size  estimates  derived  from  published  graphs.  The  second  study  looked  at  the  accuracy  of 
the  measurements  by  comparing  the  graphic  effect  sizes  (computed  from  the 
measurements)  with  the  tabular  effect  sizes  (computed  from  the  reported  means  and 
SDs).  Taken  together,  the  studies  demonstrate  that  the  conversion  of  graphs  to  effect  size 
estimates  is  reliable  and  accurate. 
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Study  One:  Reliability  Testing 


Methods 

The  authors  of  this  report  selected  the  graphs  and  conducted  the  data  preparation, 
measuring,  coding,  and  analysis. 

Forty-one  graphs  presenting  one-repetition  maximum  (1RM)  strength 
measurements  from  studies  that  employed  pre-test/post-test  designs  to  assess  resistance 
training  program  effectiveness  were  selected  for  analysis  from  four  journals:  Journal  of 
Strength  and  Conditioning  Resistance,  Medicine  in  Science  and  Sports  Exercise, 
European  Journal  of  Applied  Physiology,  and  Acta  Physiology  Scandinavica.  The  graphs 
were  taken  from  14  studies  that  presented  1RM  data  in  graphic  form  only.  The  articles 
are  indicated  in  the  reference  section. 

A  maximum  of  six  pre  and  posttest  measurements  were  taken  from  each  study  to 
ensure  adequate  study  representation.  Differences  in  the  graph  generation  between 
studies  were  expected  to  have  negligible  effects  on  the  measurement  accuracy  from  the 
graphs.  The  graphs  were  extracted  from  a  variety  of  sources  to  ensure  that  the  findings 
had  general  applicability.  Each  measurement  was  taken  from  either  a  bar  or  line  graph 
that  presented  1RM  means  and  SDs/errors.  No  control  group  measurements  were  taken. 
Furthermore,  when  results  were  presented  for  intermediate  outcomes  during  training, 
analyses  were  limited  to  the  initial  and  final  measurements. 

Once  the  graphs  were  selected,  the  following  steps  were  taken  to  obtain  individual 
measurements: 

Step  1 .  Converted  graphs  to  a  workable  size  while  maintaining  original 
proportions  (this  can  be  done  on  most  photocopiers,  or  in  Microsoft  Word  or 
PowerPoint  programs). 

Step  2.  Used  the  797B-8/200  electronic  caliper  (Starrett,  Athol,  MA). 

Step  3.  Turned  on  the  calipers,  set  them  to  measure  in  millimeters  (mm),  and 
zeroed  them  out. 

Step  4.  Placed  the  bottom  point  of  the  calipers  at  the  center  of  the  baseline  (x- 
axis);  then  opened  the  calipers  so  the  top  point  was  near  the  center  of  the  bar  or 
line  being  measured. 

Step  5.  Carefully  adjusted  the  caliper  until  the  top  point  was  placed  precisely  at 
the  center  of  the  line  being  measured. 

Step  6.  Held  calipers  at  approximately  a  45  degree  angle  away  from  the 
investigator  to  ensure  visibility  of  the  graphs  when  taking  measurements. 

Step  7.  Recorded  the  digital  reading  of  the  distance  to  the  nearest  100th  of  a  mm. 1 

The  following  steps  were  taken  to  compute  the  estimated  effect  sizes  for  bar  graphs,  line 
graphs,  and  broken  line  graphs: 


1  The  calipers  are  certified  to  be  accurate  within  .02mm  for  measurements  ranging  from  0-1 00mm  and 
within  .03  mm  for  measurements  ranging  from  100-300mm. 
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1.  The  bar  graph  below  is  used  to  illustrate  the  following  steps  (Figure  1): 

Step  1 .  Measured  and  recorded  mm  on  y-axis  from  the  baseline  (x-axis)  to  the 
maximum  value  (kg)  (A). 

Step  2.  Created  scale  conversion  unit  by  dividing  the  kgs  on  the  y-axis  (100) 
by  the  y-axis  mm. 

Step  3.  Measured  and  recorded  the  total  mm  from  the  center  of  the  baseline 
(x-axis)  to  the  top  of  the  error  bars  (B). 

Step  4.  Measured  and  recorded  the  total  mm  from  the  center  of  the  baseline 
(x-axis)  to  the  top  of  the  bar  representing  the  mean  (C). 

Step  5.  Converted  to  kgs  using  the  conversion  unit  to  arrive  at  the  estimated 
mean. 

Step  6.  Calculated  the  difference  between  the  measurement  to  the  top  of  the 
error  bars  and  the  measurement  for  the  mean  to  get  the  value  representing  the 
SD/error  (B-C). 

Step  7.  Converted  value  from  Step  6  to  kgs  using  the  conversion  unit  to  arrive 
at  the  estimated  SD/error. 

Step  8.  Calculated  effect  size  using  estimated  means  and  SDs. 


Traininn  nroun 


Figure  1.  Hansen,  Raastad,  and  Hallen  (2007) 


2.  The  line  graph  below  is  used  to  illustrate  the  following  steps  (Figure  2). 

Step  1 .  Measured  and  recorded  mm  on  y-axis  from  the  baseline  (x-axis)  to  the 
maximum  value  (kg)  (A). 

Step  2.  Created  scale  conversion  unit  by  dividing  the  kgs  on  the  y-axis  (40)  by 
the  y-axis  mm. 

Step  3.  Measured  and  recorded  the  total  mm  from  the  center  of  the  baseline 
(x-axis)  to  the  top  of  the  error  bars  (B). 

Step  4.  Measured  and  recorded  the  total  mm  from  the  center  of  the  baseline 
(x-axis)  to  the  center  of  the  line  representing  the  mean  (C) 

Step  5.  Converted  to  kgs  using  the  conversion  unit  to  arrive  at  the  estimated 
mean. 


1  RM  Bench  Press  (kg) 
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Step  6.  Calculated  the  difference  (absolute  value)  between  the  measurement 
for  the  error  bars  and  the  measurement  for  the  mean  to  get  the  value 
representing  the  SD/error  (B-C). 

Step  7.  Converted  value  from  Step  6  to  kgs  using  the  conversion  unit  to  arrive 
at  the  estimated  SD/error. 

Step  8.  Calculated  effect  size  using  estimated  means  and  SDs. 


Week  of  Training 


3.  The  broken  line  graph  below  is  used  to  illustrate  the  following  steps  (Figure  3) 
(elderly  data)2: 

Step  1 .  Recorded  mm  from  the  baseline  to  the  beginning  of  the  continuous 
part  of  the  scale  (A). 

Step  2.  This  distance  became  the  line  graph  adjustment  value. 

Step  3.  Created  scale  conversion  unit  by  dividing  the  kgs  (20)  on  the  y-axis  by 
the  y-axis  mm  (B). 

Step  4.  Took  all  subsequent  measurements  (means  and  SDs/errors)  following 
the  same  procedures  as  the  line  graphs  without  breaks  in  the  y-axis. 

Step  5.  Once  all  data  was  recorded,  subtracted  the  line  graph  adjustment  value 
from  the  unadjusted  means  and  SD/error  measurements  to  get  the  actual  mean 
and  SD/error  values.  For  example,  to  get  the  baseline  value  for  the  elderly, 
mean  =  (C-A)  and  SD  =  (D-[C-A]). 

Step  6.  Converted  values  from  Step  4  to  kgs  using  the  conversion  unit  to 
arrive  at  the  estimated  mean  and  standard  deviation/error. 

Step  7.  Calculated  effect  size  using  estimated  means  and  SDs. 


2  The  break  in  the  y-axis  is  represented  by  *  in  Figure  3. 
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Figure  3.  Hakkinen,  Alen,  Kallinen,  Newton,  and  Kraemer  (2000) 


Two  investigators  each  took  two  measurements — which  yielded  four  estimates  for 
each  effect  size.  This  provided  data  used  to  assess  intra-  and  inter-rater  reliability. 

Statistical  data  analysis. 

Analysis  of  the  data  was  conducted  using  the  computer  program  SPSS-PC 
(Version  17).  For  part  one,  a  reliability  analysis  was  conducted  to  determine  intra-  and 
inter-rater  correlations. 

Results 

Study  one’s  means,  SDs,  and  intra-  and  inter-rater  correlations  for  the  effect  sizes 
are  presented  in  Table  1. 

Table  1 

Intra  and  Inter-Rater  Correlations  for  Study  One 


Rater 

Trial 

Mean 

SD 

Rater  1 
Trial  1 

Rater  1 
Trial  2 

Rater  2 
Trial  1 

Rater  2 
Trial  2 

1 

1 

1.36 

.95 

1.000 

.997 

.996 

.996 

1 

2 

1.38 

.97 

.997 

1.000 

.993 

.995 

2 

1 

1.35 

.96 

.996 

.993 

1.000 

.994 

2 

2 

1.35 

.90 

.996 

.995 

.994 

1.000 

The  results  of  study  one  show  an  average  correlation  of  0.997  between  estimated 
effect  sizes.  Higher  correlations  were  seen  between  test-retest  than  inter-rater 
measurements,  but  all  correlations  were  strong  (>0.993).  Furthermore,  no  difference  was 
seen  between  the  reliability  of  bar  graph  and  line  graph  measurements  (both  correlations 
=  0.997). 
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Discussion 

Study  one  demonstrated  that  using  electronic  calipers  to  take  measures  from 
graphs  of  the  effects  of  strength  training  programs  produced  results  that  were  highly 
reliable.  The  evidence  demonstrated  strong  test-retest  reliability  and  inter-rater  reliability. 
However,  it  was  still  unknown  how  accurate  these  graphic  effect  size  estimates  were 
when  compared  to  the  effect  size  estimates  that  would  have  been  obtained  if  the  tabular 
means  and  standard  deviations  for  the  effects  had  been  reported.  Study  two  addressed  the 
question  of  how  accurate  the  estimates  of  effect  sizes  are  when  derived  from  graph 
measurements. 
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Study  Two:  Accuracy  Testing 


Methods 

Two  researchers  each  selected  a  set  of  20  paired  pretraining  and  posttraining  1RM 
measurements  at  random  from  a  set  of  728  effect  sizes  reported  in  181  articles  containing 
data  on  the  effects  of  resistance  training  on  maximal  strength.  The  means  and  SDs  for  the 
measurements  needed  to  construct  the  effect  sizes  were  extracted.  The  articles  that 
provided  those  statistics  are  indicated  in  the  reference  section. 

The  information  extracted  from  the  articles  was  used  to  construct  40  effect  size 
estimates  as  follows: 

a.  Each  researcher  used  Microsoft  Excel  to  create  20  bar  graphs  with  error  bars  from 
the  set  of  20  pretraining  and  posttraining  means  and  standard  deviations/errors 
that  he  or  she  had  selected. 

b.  Each  investigator  used  the  methods  described  in  Study  1  to  obtain  the 
measurements  needed  to  compute  effect  size  estimates  from  the  graphs  created  by 
the  other  investigator.  Those  measurements  were  taken  without  any  knowledge  of 
the  true  means  and  SDs/errors  that  had  been  used  to  create  the  graphs. 

c.  After  measurements  were  recorded,  effect  sizes  were  calculated  as  described  in 
study  one. 

A  data  file  was  constructed  that  included  the  effect  size  estimates  derived  from 
the  means  and  standard  deviations  reported  in  the  original  articles  (tabular  effect  sizes) 
and  the  effect  size  estimates  derived  from  the  graph  measurements  (graphic  effect  size 
estimates).  The  file  contained  40  paired  observations — 20  from  each  investigator. 

Statistical  data  analysis. 

Paired  t  tests  were  conducted  using  SPSS-PC.  The  t  tests  compared  effect  size 
computed  from  the  reported  means  and  standard  deviations  to  the  graph-based  estimates. 

Results 

Means,  SDs  correlation,  t  values,  and  significance  for  study  two’s  effect  sizes  are 
reported  in  Table  2. 
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Table  2 

Effect  Sizes  for  Study  Two 


Graphic _ Tabular 


Mean 

SD 

Mean 

SD 

Effect 

Size3 

Correlation 

t 

Sig 

Rater  1 

.80 

.75 

.79 

.74 

0.008 

0.9998 

3.26 

.004 

Rater  2 

.96 

.55 

.95 

.54 

0.007 

0.9996 

2.18 

.042 

The  correlations  between  the  graphic  and  the  tabular  effect  sizes  were  0.9998  and 
0.9996  for  raters  1  and  2  respectively.  There  appears  to  be  a  statistically  significant 
difference  between  the  graphic  and  tabular  effect  sizes  (rater  1 ,  p=0.004;  rater  2, 
p= 0.042). 

The  statistically  significant  differences  were  due  only  to  the  high  correlation  of 
the  graphic  effect  sizes  with  the  tabular  effect  sizes.  To  demonstrate  this,  the  differences 
between  tabular  and  graphic  means  were  converted  to  effect  sizes  using  the  following 
equation:  Effect  Size  =  (Graphic  Mean-Tabular  Mean)/Tabular  SD.  When  the  differences 
between  the  tabular  and  graphic  effect  sizes  were  converted  to  effect  sizes  using  the 
means  and  SDs  (see  Table  2),  the  results  were  0.008  for  rater  1  and  0.007  for  rater  2. 
These  effect  sizes  would  have  to  be  more  than  10  times  larger  to  be  considered 
practically  or  theoretically  important  (Cohen,  1988). 

Discussion 

Graphic  measurements  yield  accurate  effect  size  estimates.  As  a  consequence, 
effect  size  estimates  derived  from  graphs  can  be  combined  with  those  derived  from 
reported  means  and  SDs.  In  this  study,  the  effect  sizes  derived  by  the  two  methods  were 
almost  perfectly  correlated.  In  addition,  the  means  and  SDs  were  virtually  identical.  The 
effect  sizes  derived  from  graphs  were  smaller  than  the  true  values,  but  the  difference  was 
only  .01  for  each  investigator.  The  difference  was  statistically  significant  in  each  case, 
but  it  was  too  small  relative  to  the  SD  of  the  effect  sizes  to  be  of  theoretical  or  practical 
importance. 


General  Discussion  and  Conclusions 

Measurements  taken  from  graphs  can  provide  acceptable  effect  size  estimates. 

The  studies  reported  here  demonstrated  that  those  measurements  yield  reliable  and 
accurate  estimates  of  effect  sizes.  Test-retest  and  inter-rater  reliability  were  high  and  the 
difference  between  known  effect  sizes  and  estimated  effect  sizes  were  trivial. 

The  results  of  these  studies  may  have  a  significant  limitation.  The  work  reported 
here  relied  on  electronic  calipers  with  digital  measurement  readouts.  Other  calipers  might 
produce  somewhat  lower  reliability  and  accuracy  if  the  raw  measurements  were  less 
reliable. 
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Several  issues  arose  when  taking  measurements  of  the  graphic  data.  To  assure 
accuracy  in  extracting  data  from  graphs  using  electronic  calipers,  potential  errors  must  be 
avoided.  The  following  steps  should  be  taken  to  avoid  possible  sources  of  error: 

-  Close  calipers  completely  when  zeroing  out. 

-  Properly  account  for  breaks  in  the  scale  on  the  y-axis  in  line  graphs  (see  Figure  3). 

-  Make  sure  to  measure  from  the  numerical  values,  not  the  break  lines,  when 
working  with  breaks  on  the  y-axis  (see  Figure  3). 

-  Be  consistent  in  where  the  line  measurements  are  being  taken  (e.g.  in  a  bar  graph 
if  you  place  the  bottom  caliper  point  in  the  middle  of  the  x-axis,  place  the  top 
caliper  point  in  the  middle  of  the  line  on  the  top  of  the  bar).  This  is  particularly 
important  if  the  lines  are  thick. 

-  Place  the  calipers  on  the  actual  baseline  for  all  measurements  in  graphs  where  the 
bars  extended  below  the  baseline. 

Electronic  calipers  appear  to  be  a  reliable  and  accurate  tool  for  measuring  and 
converting  graphs  to  estimates  of  effect  sizes.  This  study  looked  at  both  test-retest  and 
inter-rater  reliability  and  found  high  correlations  among  both.  These  findings  demonstrate 
a  novel  method  of  incorporating  a  more  comprehensive  selection  of  the  existing  research 
when  conducting  meta-analyses. 
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